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This  paper  presents  a  parallel  version  of  the  Fast  Multipole  Method  (FMM).  The  FMM  is  a  re¬ 
cently  developed  scheme  for  the  evaluation  of  the  potential  and  force  fields  in  systems  of  particles 
whose  interactions  are  Coulombic  or  gravitational  in  nature.  The  sequential  method  requires  O(.V) 
operations  to  obtain  the  fields  due  to  N  charges  at  N  points,  rather  than  the  0(N2)  operations 
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as  log  Ar  when  using  N  processors.  Numerical  results  are  given  for  a  shared  memory  machine  (the 
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1.  Introduction 


Numerical  methods  for  computing  Ar-body  interactions  generally  fall  into  two  categories.  Con¬ 
tinuum  methods  are  based  on  the  fact  that  the  potential  satisfies  the  Poisson  equation  and  use  fast 
Poisson  solvers  to  obtain  the  field  [6].  They  are  hindered  by  the  limited  resolution  of  the  imposed 
grid  and  the  degradation  of  performance  seen  with  highly  non-homogeneous  distributions  of  parti¬ 
cles.  Hierarchical  methods  [l,  2]  are  based  on  the  fact  that  the  field  due  a  cluster  of  particles  can 
be  represented  at  a  great  distance  by  the  net  mass  acting  at  the  center  of  mass.  Tree  structures 
are  used  to  partition  space  and  group  particles  at  various  length  scales,  so  that  the  center  of  mass 
approximation  can  be  applied.  The  CPU  time  requirements  of  these  methods  generally  grow  as 
TV  log  N.  They  handle  non-homogeneous  distributions  better  than  the  continuum  methods,  but 
yield  only  approximate  results. 

The  Fast  Multipole  Method  (FMM)  [5,  3,  8,  7]  shares  certain  characteristics  with  the  hier¬ 
archical  soh'ers.  Tree  structures  are  imposed  to  partition  space,  and  the  strategy  is  similar,  but 
analytic  observations  concerning  multipole  and  Taylor  expansions  are  used  to  produce  results  that 
are  accurate  to  within  round-off  error.  The  CPU  time  requirements  are  of  the  order  0(N  log(l/e)), 
where  e  is  the  desired  accuracy. 

In  this  paper,  we  will  describe  a  parallel  version  of  the  non-adaptive  two-dimensional  FMM 
and  present  numerical  results  for  an  implementation  on  a  shared  memory  machine  (the  Encore 
Multimax  320).  We  note  that  Zhao  [9]  has  independently  developed  a  parallel  implementation  of 
a  non-adaptive  three-dimensional  multipole  method  for  the  Connection  Machine. 

2.  Mathematical  Preliminaries 

In  this  paper,  we  will  consider  as  a  model  the  TV-body  problem  in  the  complex  plane  C.  That 
is,  given  the  positions  z,  and  strengths  <?;  of  N  charged  particles,  we  wish  to  compute  the  net 
potential  <f>  and  electric  field  E  at  each  particle  position  from  Coulomb’s  law.  These  are  given  by 
the  expressions 

4>{zi)  =  £«  I  53  ’  Iog^‘  ~ 

W« 

and 

E(Zi)  =  {-RetfXzMJmitXzi)))  , 

respectively. 

Suppose  now  that  m  charges  with  strengths  q,  and  positions  z,  are  located  within  a  disk  of 
radius  r  centered  at  the  origin.  Then,  it  is  shown  in  [5],  that  for  a  point  z  with  |z]  >  r,  the  potential 
4>(z)  induced  by  the  charges  is  given  by  a  multipole  expansion  of  the  form 
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4>(z)  =  <?iog(z)  +  J3  pr  .  (2-i) 
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The  error  in  truncating  the  sum  after  s  terms  is 

4>(z)  -  Q  log(z)  "  E  7F  |  ^  (~l)  (7)  ’  (2-2) 


where 
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In  order  to  obtain  a  relative  precision  of  c  (with  respect  to  the  total  charge),  the  number  of 
terms  required  in  the  series  representation  of  <p  is  approximately  -  logc(c),  independent  of  m,  the 
number  of  source  charges.  The  Fast  Multipole  Method  is  based  on  making  explicit  use  of  this 
result. 

2.1.  Translation  operators 

In  the  FMM  scheme,  it  is  necessary  not  only  to  form  multipole  expansile  in  (2  1),  but  to 
carry  out  a  sequence  ot  analytic  transformations  of  the  expansion  coefficients.  These  transforma¬ 
tions  are  described  in  the  next  three  lemmas.  Detailed  proofs  can  be  found  in  [5].  The  first,  Lemma 

2.1,  provides  a  mechanism  for  shifting  the  center  of  a  multipole  expansion. 

Lemma  2.1.  (Translation  of  a  Multipole  Expansion)  Suppose  that 
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is  a  muitipole  expansion  of  the  potential  due  to  a  set  of  m  charges  of  strengths  q\,qi, . .  .,qm>  ad  of 
which  are  located  inside  the  circle  D  of  radius  R  with  center  at  z0.  Then  for  z  outside  the  circle 
D  j  of  radius  (R  +  |zo|)  and  center  at  the  origin, 

0°  . 

4>(z)  =  Golog(z)  +  ^2  -j-,  (2.5) 
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with  (l)  the  binomial  coefficients.  Furthermore,  for  any  s  >  1, 

<t>(z)  -  a0\og(z)  -  <  (^Tt)  (")  ’  (2J) 

where  A  is  defined  in  (2.3)  and 

z 

l^o !  +  R 

Lemma  2.2  describes  the  conversion  of  a  multipole  expansion  into  a  local  (Taylor)  expansion 
inside  a  circular  region  of  analyticity. 

Lemma  2.2.  (Conversion  of  a  Multipole  Expansion  into  a  Local  Expansion)  Suppose 
that  m  charges  of  strengths  9i,92»— »9m  are  located  inside  the  circle  D\  with  radius  R  and  center 
at  zo,  and  that  |z0|  >  (c  +  1)/?  with  c  >  1.  Then  the  corresponding  multipole  expansion  (2.4) 
converges  inside  the  circle  D 2  of  radius  R  centered  about  the  origin.  Inside  D^,  the  potential  due 
to  the  charges  is  described  by  a  power  series: 
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where 


and 
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(2.9) 
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Furthermore,  for  any  s  >  max 


^2,  an  error  bound  for  the  truncated  series  is  given  by 
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where  A  is  defined  in  (2.3)  and  e  is  the  base  of  natural  logarithms. 

Lemma  2.3  provides  a  formula  for  shifting  the  center  of  a  local  expansion  within  a  region  of 
analyticity.  This  translation  is  exact,  and  no  error  bound  is  needed. 

Lemma  2.3.  (Translation  of  a  Local  Expansion)  For  any  complex  zq,z  and  {a^},  k  = 

0,1,2  ,...,n, 


where 
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3.  Informal  Description  of  the  FMM 

In  this  section,  we  briefly  outline  the  sequential  FMM  procedure.  A  more  detailed  discussion 
is  available  in  [5,  7].  The  algorithm  uses  a  divide  and  conquer  strategy  to  cluster  particles  at 
various  levels  of  spatial  discretization,  and  then  uses  multipole  and  Taylor  expansions  to  evaluate 
the  interactions  between  distant  clusters.  Once  all  distant  interactions  are  accounted  for  by  this 
expansion  technique,  the  interactions  between  neighboring  particles  are  computed  by  the  direct 
application  of  the  pairwise  force  law. 

We  now  introduce  the  notation  necessary  for  a  description  of  the  algorithm.  Since  we  are 
considering  the  non-adaptive  scheme,  we  assume  that  N  charges  are  more  or  less  homogeneously 
distributed  within  a  square  with  sides  of  length  one,  and  refer  to  this  square  as  the  computational 
box.  We  impose  a  hierarchy  of  meshes  on  the  computational  box  which  refine  it  into  smaller  and 
smaller  regions.  More  specifically,  mesh  level  0  refers  to  the  entire  computational  box,  while  mesh 
level  /  +  1  is  obtained  recursively  from  level  l  by  subdividing  each  box  into  four  equal  parts.  A  tree 
structure  is  imposed  on  this  hierarchy,  so  that  if  ibox  is  a  box  at  level  /,  then  the  four  boxes  at  level 
l  +  1  obtained  by  its  subdivision  are  considered  its  children.  In  general,  the  maximum  number  of 
refinements  (the  tree  depth)  is  chosen  to  be  on  the  order  of  log^  N,  at  which  point  there  is  on  the 
order  of  1  particle  in  each  box  at  the  finest  level.  For  every  box  i  at  level  1,  we  define  the  nearest 
neighbors  to  be  the  box  itself  and  any  other  box  at  the  same  level  with  which  it  shares  a  boundary 
point.  There  are  clearly  at  most  9  nearest  neighbors. 

Two  boxes  (at  a  given  level)  with  sides  of  length  D,  are  said  to  be  well-separated  if  they  are 
separated  by  a  distance  D.  It  is  shown  in  [7]  that,  in  using  s-term  expansions  to  account  for 
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the  interactions  between  well-separated  boxes,  the  error  bounds  (2.2),  (2.7)  and  (2.11)  apply  with 
c  =  (4  -  \/2)/v/2  ss  1.8.  For  a  given  precision  c,  we  therefore  choose  s  =  [— logc(e)]. 

Both  multipole  and  local  expansions  are  associated  with  each  box.  $/y  is  the  5-term  multipole 
expansion  about  the  center  of  box  i  at  level  /  which  describes  the  far  field  potential  due  to  the 
particles  contained  inside  the  box.  41/,,  is  the  s-term  local  expansion  about  the  center  of  box  i 
at  level  /  which  describes  the  potential  field  due  to  all  particles  outside  the  box  and  its  nearest  1 

neighbors.  is  the  s-term  local  expansion  about  the  center  of  box  i  at  level  l  which  describes  the 
potential  field  due  to  all  particles  outside  i’s  parent  box  and  the  parent  box’s  nearest  neighbors. 

Finally,  an  interaction  list  is  associated  with  each  box  i  at  level  /.  This  is  the  set  of  boxes  which 
are  children  of  the  nearest  neighbors  of  i’s  parent  and  which  are  well-separated  from  box  i. 

The  algorithm  computes  interactions  between  groups  of  particles  at  the  coarsest  possible  mesh 
level.  Two  passes  are  executed. 

Initialization  Choose  a  level  of  refinement  n  [log4  N],  a  precision  e,  and  set  s  =  [— log.(f)] . 

Assign  particles  to  boxes  at  finest  mesh  level. 

Upward  Pass 

Step  1:  Form  multipole  expansion  about  the  box  center  for  each  box  i  at  finest  mesh  level. 

Uses  equation  (2.1). 

Step  2:  Recursively  form  multipole  expansions  about  the  centers  of  all  boxes  at  all  coarser  mesh 
levels,  each  expansion  representing  the  potential  field  due  to  all  particles  contained  in  the  box.  Uses 
Lemma  2.1. 


In  the  downward  pass,  the  local  expansions  are  formed  for  all  boxes,  beginning  at  the 
coarsest  level.  This  process  is  somewhat  more  complex.  Suppose,  however,  that  at  level  /  -  1,  the 
local  expansion  has  been  computed.  Then  Lemma  2.3  can  be  used  to  shift  the  expansion 

to  each  of  the  box’s  children.  For  each  child  box  j  at  level  /,  what  w'e  have  obtained  is  a  local 
representation  of  the  field  due  to  all  particles  outside  the  parent’s  nearest  neighbors,  namely  4'/,,. 
The  interaction  list  defined  above  is  precisely  the  set  of  boxes  whose  contribution  to  the  potential 
must  be  added  to  ty/y  to  create  ty/,,-.  The  initialization  of  this  pass  is  simple.  Since  there  are  no 
well-separated  boxes  at  level  0  or  1,  we  may  set  $0,t,  ’J’l.t,  ^l.i  and  to  zero. 

Downward  Pass 


Steps  3,4:  Begin  at  level  2,  and  proceed  to  finer  levels  as  follows:  form  by  using  Lemma  2.2 

to  convert  the  multipole  expansion  of  each  box  j  in  the  interaction  list  of  box  ibox  to  a  local 
expansion  about  the  center  of  box  ibox,  adding  these  local  expansions  together,  and  adding  the 
result  to  Vi,ibox-  If  finest  level  has  been  reached,  process  is  complete.  Otherwise  form  the  expansion 
Vl+i,j  for  ibox' s  children  by  using  Lemma  2.3  to  expand  Vi^ox  about  the  children’s  box  centers 
and  continue  procedure. 

Step  5:  Evaluate  local  expansions  at  particle  positions  to  obtain  the  far-field  potential  and/or  force. 
Step  6:  Compute  potential  (or  force)  due  to  particles  in  nearest  neighbor  boxes  directly. 

Step  7:  For  every  particle,  add  direct  and  far-field  terms  together. 


4.  Description  of  the  parallel  algorithm 

The  algorithm  described  in  the  previous  section  has  several  opportunities  for  parallelism.  The 
best  of  these  opportunities  are  the  completely  parallel  operations  such  as  the  computation  of  the 
initial  moments  and  the  evaluation  of  the  local  expansions  for  each  particle.  The  other  parallel 
operations  require  some  coordination  between  processors.  For  example,  evaluation  of  the  forces 
between  particles  in  neighboring  boxes  in  step  6  requires  secure  data  access  if  Newton's  third  law 


N 

Talg 

P  =  1 

Tdir 

P  =  1 

Talg 
p-  16 

Tdir 

p  =  16 

Speedup 

Alg 

Speedup 

Dir 

625 

14 

54 

1.2 

3.45 

11.7 

15.7 

1250 

52 

216 

3.6 

13.9 

14.4 

15.5 

2500 

68 

872 

4.6 

54.9 

14.7 

15.9 

5000 

235 

3490 

15.5 

220.8 

15.2 

15.9 

10000 

301 

14020 

19.7 

910.4 

15.3 

15.4 

20000 

1008 

56385 

65.0 

3560.4 

15.5 

15.8 

Table  1:  Table  of  times  for  algorithm  (alg)  and  direct  method 
(dir)  on  Encore  Multimax  320.  All  times  in  seconds. 
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Figure  1:  Speedup  for  the  calculation  of  the  far-neld  by  [  n  .  I 

the  Fast  Multipole  Method.  The  labels  are  the  number  of  f-\ - 

particles. 


is  used.  However,  it  is  the  reductions,  the  communications  between  mesh  levels,  that  cause  the 
greatest  difficulty. 

The  fact  that  the  entire  program  is  not  completely  parallel  opens  the  question  of  how  efficient 
the  algorithm  can  be,  particularly  on  a  large  number  of  processors.  We  will  adcliess  this  by  analyzing 
the  computational  complexity  of  the  parallel  algorithm.  We  do  not  discuss  the  initialization  of  the 


p 


Figure  2:  Speedup  for  the  near-held  calculation  (step  6). 
The  labels  are  the  number  of  particles. 


algorithm  since  the  initialization  is  essentially  a  parallel  sort  and  is  performed  only  at  the  beginning 
of  a  compulation.  Since  a  common  use  of  the  Fast  Multipole  Method  is  to  compute  the  forces  at 
each  time  step  in  a  dynamical  simulation,  this  initial  sort  may  be  amortized  over  all  the  time  steps. 
Further,  in  a  time  dependent  calculation,  it  is  possible  to  exploit  slowly  varying  changes  in  the 
potential  to  reduce  the  amount  of  computation;  we  will  not  consider  this  effect  either. 

For  the  actual  implementation,  we  can  consider  each  of  the  steps  in  the  algorithm  separately. 
We  use  the  term  “communication”  to  denote  any  coordination  between  processors.  In  a  message 
passing  system,  this  would  be  a  message;  in  a  shared  memory  system,  this  would  be  some  critical 
section  (e.g.,  a  spinlock).  We  use  N  to  denote  the  number  of  particles,  n  the  number  of  levels,  and 
p  the  number  of  processors.  Let  B  denote  the  average  number  of  particles  per  box  at  the  finest 
level.  Then  N  =  BAn. 


Upward  Pass 

Step  1:  Formation  of  expansions  at  finest  level.  There  is  no  communication;  the  complexity  is  N/p. 
Step  2:  Merge  upward.  Communication  is  within  box  and  with  parent  box.  The  complexity  is 
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Figure  3:  Parallel  efficiency  for  the  full  algorithm.  The 
labels  are  the  number  of  particles. 


times  the  cost  per  box.  This  is  one  of  the  most  important  terms  because  it  gives  a  limit  on  the 
available  parallelism.  The  second  sum  shows  the  bottleneck  in  the  reduction:  when  there  are  fewer 
than  p  boxes,  some  processors  go  idle.  This  bottleneck  is  an  essential  part  of  the  algorithm. 
Downward  pass 

Steps  3,4:  Convert  the  multipole  expansions  into  local  expansions  and  move  down.  The  complexity 
is  of  the  same  form  as  above  (but  with  a  different  constant). 

Step  5:  Evaluation  of  local  expansions  at  the  finest  level.  This  is  perfectly  parallel  and  has  com¬ 
plexity  N/p. 

Step  6:  Compute  the  potential  or  force  due  to  particles  in  neighboring  boxes  directly.  This  involves 
the  neighboring  boxes  through  direct  interaction.  There  is  no  “reduction”  overhead;  however, 
there  is  some  communication  due  to  the  fact  that  the  field  at  a  particle  position  may  be  updated 
by  several  adjacent  processors/particles.  The  complexity  is  roughly  4 nB2 /p  =  N B/p.  If  Newton’s 
third  law  is  not  used,  this  is  completely  parallel  (i.e.,  we  can  eliminate  any  possibility  of  memory 
contention  at  the  cost  of  twice  as  much  arithmetic). 

Step  7:  Add  the  components  (direct  and  far-ficld)  together  at  each  particle.  This  is  perfectly 
parallel;  the  complexity  is  N/p. 
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Summing  the  contributions  from  each  step,  the  overall  complexity  is 


m  aN  ,,  cN  dNB  ,  „  x 

T  =  - +b  log4  p  +  —  +  — —  +  e(N ,  p, 

P  Bp  p 


where  a,  b ,  c,  and  d  are  constants  determined  by  the  floating  point  speed  and  the  requested  preci¬ 
sion,  and  e  is  a  lower  order  term  which  includes  things  like  the  communication  or  synchronization 
overhead. 

One  important  observation  to  be  made  concerns  the  choice  of  the  parameter  B  which  deter¬ 
mines  the  number  of  refinement  levels  n.  For  the  uni-processor  case,  this  is  discussed  in  [3],  In 
the  parallel  case,  there  is  a  temptation  to  use  B  =  1.  While  a  value  for  B  of  1  has  the  advantage 
of  placing  (at  most)  one  particle  per  processor,  this  can  Tesult  in  a  serious  loss  in  efficiency.  The 
optimal  value  of  B,  denoted  Bopt,  for  a  given  N  and  p,  is  the  one  that  minimizes  the  time  T  in 
(4.1): 

dT  _  _cN_  dN_  _ 

dB  B2p  p 


While  the  exact  values  for  c  and  d  are  difficult  to  determine  (they  depend  on  the  floating  point 
rates,  memory  speed,  and  details  of  the  coding  and  the  compiler  used),  c  is  roughly  proportional 
to  25s2,  where  s  is  the  number  of  terms  in  the  expansion,  while  d  is  roughly  9,  the  number  of 
neighboring  boxes.  The  optimal  blocking  is  therefore  Bopt  %  2s.  For  single  precision  accuracy, 
s  ss  15,  so  that  Bopt  »  30  and  the  execution  time  for  B  =  1  is  about  15  times  slower.  Note  that 
Bopt  is  independent  of  the  number  of  processors,  and  should  be  used  for  both  parallel  and  sequential 
implementations. 

It  is  clear  from  (4.1)  that  with  O(N)  processors,  the  overall  complexity  is  C>(logA').  The 
parallel  efficiency  can  be  estimated  as  the  ratio  of  the  time  on  a  single  processor  (without  the 
overhead)  to  p  times  the  time  on  p  processors: 


efficiency  = 


p(-T  +  61og4p  +  e(Ar,p)) 


1  +  loS*i  V  + 


.  pb  ,  pt 

1  “  aAT  °g4P~  oTV’ 


where  a  =  a  c/ B  +  dB. 

It  is  interesting  to  look  at  the  behavior  of  equation  (4.3)  for  the  cases  B  =  1  and  for  B  =  Bop(, 
with  p  =  N/B  (i.e.,  one  box  at  the  finest  level  per  processor).  We  do  not  consider  using  more  than 
one  processor  per  box.  In  the  first  case,  the  efficiency  is  roughly 


1  "  a  l0g4  N 


where  we  have  neglected  the  overhead  term  e(p,N).  With  the  optimal  B,  the  efficiency  is 


f»V jV! 


To  give  an  idea  of  what  this  means,  assume  that  the  B  =  1  case  gjves  90%  efficiency,  and  let 
r?  ,pt  =  10  (this  is  quite  conservative).  Then  the  choice  B  =  Bopt  will  achieve  more  than  99% 
efficiency.  Moreover,  if  the  B  =  1  case  gives  10%.  efficiency,  the  choice  B  —  Bopt  will  achieve  more 
than  90%,  efficiency  (assuming  that  the  parallel  overhead  terms  e(p,N)  are  negligible). 

The  above  analysis  gets  even  worse  if  we  consider  the  case  of  true  speedup,  defined  as 

Time  for  best  uni-processor  algorithm 
Time  for  parallel  algorithm 

In  this  case,  the  efficiency  for  B  =  1  is  roughly  2jBopt  times  the  formula  in  equation  (4.3).  In 
short,  for  good  processor  utilization,  an  implementation  of  the  Fast  Multipole  Method  should  use 
the  optimal  blocking  factor  Bopt. 

Our  last  consideration  is  given  to  minimizing  the  total  time  T ,  given  a  fixed  number  of  particles 
N  but  varying  the  number  of  processors  p.  In  this  case,  the  minimum  time  depends  in  a  complicated 
way  on  the  various  parameters,  but  is  achieved  at  roughly  p  =  AT .  This  can  be  seen  from  the 
efficiency  figures  above.  For  example,  if  the  case  B  =  1  gave  90%  parallel  efficiency  relative  to  a 
sequential  program  with  B  —  1,  it  would  give  18%  efficiency  relative  to  a  sequential  program  with 
Bopt  =  10-  Each  processor  is  busy  roughly  one  fifth  of  the  time.  However,  there  are  10  times  as 
many  processors  for  B  =  1  as  for  B  =  10,  so  the  total  time  for  B  =  1  and  p  =  A’  is  roughly  half 
that  for  B  =  10  an  1  p  =  N/B.  Thus,  even  with  the  loss  of  efficiency,  the  5  =  1  case  has  a  smaller 
absolute  time.  This  fact  is  of  interest  only  when  there  are  enormous  numbers  of  processors  available, 
and  when  it  is  impractical  to  use  the  excess  processors  to  work  together  on  the  computations  for 
each  box. 

4.1.  Comments  on  the  parallel  implementation 

Our  parallel  implementation  is  based  on  a  version  of  the  serial  code  described  in  [5].  The  implemen¬ 
tation  is  a  “minimum  distance”  change,  and  does  not  attempt  to  rearrange  the  computation  to  be 
more  parallelizable.  In  particular,  it  is  possible  to  identify  subclasses  of  boxes  for  which  completely 
parallel  operations  may  be  performed;  within  these  classes  it  can  be  proven  that  no  data-access 
conflicts  can  occur  [4].  This  can  reduce  the  overhead  in  steps  2,  3,  and  4  by  reducing  the  number 
of  memory  locks  (in  a  shared  memory  implementation)  or  the  number  of  messages  (in  a  message 
passing  implementation). 

5.  Experimental  Results 

These  results  are  for  the  non-adaptive  algorithm,  and  do  not  include  the  work  of  the  initial  sorting 
of  the  particles.  The  data  in  Table  1  were  obtained  on  an  Encore  Multimax  320  with  18  processors. 
There  are  two  important  points  to  remember  in  interpreting  these  times.  One  is  that  they  were  taken 
on  a  time  sharing  system;  even  though  no  other  users  were  present,  various  daemons  will  consume 
some  resources.  To  reduce  this  effect,  we  used  only  16  of  the  18  processors  in  our  experiments. 
The  second  is  the  effect  of  the  choice  of  number  of  levels  While  the  complexity  estimates  predict 
time  linear  in  the  number  of  particles,  in  fact  the  actual  times  display  a  “ratchet”  behavior  as  the 
number  of  levels  increase.  However,  over  a  large  enough  range  of  number  of  particles,  the  behavior 
is  linear. 

Figures  1-3  show  a  breakdown  of  the  results  for  the  Encore  Multimax  320.  Figure  1  shows  the 
speedup  for  the  calculation  of  the  far  field  by  the  Fast  Multipole  Method.  Note  that  the  results 
are  clustered  into  four  groups;  these  represent  the  number  of  levels  (4-7).  The  speedup  is  lower 
when  the  number  of  processors  is  not  a  power  of  four,  a  result  of  the  poor  load-balancing  in  the 
reduction  stages  (steps  2-4).  Figure  2  shows  the  speedup  for  the  calculation  of  the  near  field  in 
step  6.  The  deviation  from  perfect  speedup  is  due  mainly  to  the  overhead  connected  with  secure 
access  to  data  (critical  sections).  Figure  3  shows  the  overall  efficiency.  Note  that  even  for  small 


rumbers  of  particles  75%  efficiency  is  achieved  and  for  5000  or  more  particles,  95%  efficiency  is 
achieved. 

A  version  of  the  three-dimensional  multipole  method  has  been  implemented  and  studied  by 
Zhao  [9].  His  results  show  the  predicted  log  A'  growth  as  N  ran  from  64  to  16384.  His  timings,  done 
on  the  Connection  Machine,  are  somewhat  slow.  Different  formulations  of  the  algorithm  presented 
here,  in  particular  with  respect  to  constant  terms  or  terms  in  —  logc,  should  significantly  reduce 
the  timings. 

6.  Conclusions 

Our  results  have  shown  that  the  Fast  Multipole  Method  is  very  suitable  for  shared  memory  parallel 
computers.  Both  our  experience  and  the  results  of  Zhao  indicated  that  it  is  suitable  for  message 
passing  parallel  computers  as  well.  The  overall  complexity  of  the  A'-body  calculation  (with  p  =  A' 
processors)  is  log  A’;  for  fixed  N  it  is  A  /p  -t-  log/). 

The  non-adaptive  algorithm  described  here  has  very  regular  memory  access  or  communication 
patterns  which  can  be  exploited  to  reduce  the  parallel  overhead.  Many  of  these  are  intrinsic  to  the 
Fast  Multipole  Method  itself,  and  should  be  exploitable  by  the  adaptive  version. 
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